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Abstract 

In this paper we present randomized algorithms for sorting and convex hull that achieves optimal 
performance (for speed-up and cache misses) on the multicore model with private cache model. Our 
algorithms are cache oblivious and generalize the randomized divide and conquer strategy given by 
Reischuk P3] and Reif and Sen [17]. Although the approach yielded optimal speed-up in the PRAM 
model, we require additional techniques to optimize cache-misses in an oblivious setting. Let p, n, M, B 

j» respectively denote number of processors, problem size, the size of individual processor cache memory 

and block size respectively, then we obtain expected parallel running time (9(^logn + log n log log n) 
with expected log M n) cache misses for sorting n keys and constructing convex hull of n points. 
For p < log "„.„ , and under the tall-cache assumption M > B 2 , both speed-up and cache- misses are the 
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best possible. Since the input-size n > Mp, under a very mild assumption M > log log n, p < log ™ - , 
so in all realistic scenarios, our algorithm will have optimal time and cache misses with high probability. 
Although similar results have been obtained recently for sorting [11] , we feel that our approach is simpler 
and general and we apply it to obtain an algorithm for 3D convex hulls with similar bounds. 

We also present a simple randomized processor allocation technique without the explicit knowledge of 
the number of processors that is likely to find additional applications in resource oblivious environments. 
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1 Introduction 



The private-cache multicore model and the closely related Parallel External Memory (PEM) model com- 
bines several features of parallel computing models like PRAM and the memory hierarchy issues captured 
by the External Memory Models. The goal is to capture the relevant aspects of a large scale multiprocess- 
ing environment, whose numerous parameters may be unknown to the algorithm designer. Although, it 
is not intuitive, this last requirement can be tackled using the strategy called cache obliviousness or more 
generally resource obliviousness. 

These multiprocessing models consists of p processors (or cores) each having a private cache of size M. 
There is a large global memory which acts as a shared memory. The processors communicate with each 
other only through shared memory. Initially the input is present in the global memory stored in form of 
blocks of size B. So an input of size n uses n/B blocks in the global memory. All the transfers to and 
from memory (or cache) are done in blocks of size B. Whenever a core needs some data, say x, if the data 
is already present in its private cache then no cost is incurred but if it is not present in the cache then 
it copies this block from the shared memory to its private cache. Similarly when a core wants to write a 
block, first that block is moved into its private cache then those changes can be updated in shared memory 
at that time or later using some write-back policies. When a core modifies a block in its cache, then all 
the copies of that block which are present in the cache of other cores are invalidated and for any future 
access to that block it needs to be read again from the shared memory. To obtain full parallelism we need 
to fill the cache of each core at least once, i.e. n > M ■ p. 

Thus we have two types of cache related-cost incurred in this modeQ 

1. Cache misses : This is the standard type of cache miss which also occurs in sequential computation. 
Whenever any core need some data which is not present in its cache, that block is copied from main 
memory to its cache. This is treated as one cache miss. If some block B\ is evicted from cache 
because of cache replacement policy and processor again need this B\ block, it will again cost one 
cache miss to access this block (also known as capacity misses). 

2. Block misses : These kind of misses occur only in the case of concurrent writing. It doesn't have 
an analogue in sequential computation and requires extra care in parallel computation. Suppose two 
cores C\ and C2 are accessing the same block B\ . If C\ modifies the contents of this block then the 
copy of B\ present in C2's cache is invalidated and it has to be read again which will be counted as 
one cache miss. Note that block misses increase when a larger number of cores are accessing the same 
block and one core modifies it. So the block misses occur when multiple processors are accessing the 
same block at same time and at least one of them is writing. An algorithm designer will try to avoid 
this situation as much as possible. 

Both concurrent reads and concurrent writes are allowed in this model. 

Concurrent Reads : If x cores are reading the same block, every core will incur one cache miss and the 
total cache cost will be x (unless some processor writes on this block that will be treated as a block miss). 
Concurrent Writes : We have two cases: 

1. When x cores are reading one block and one core writes to that block at same time. All these cores will 
have to read this block again from shared memory, which will lead to a total of x cache misses. 

2. If there are x cores and all of them want to write to the same block, this block will migrate across these 
x cores to satisfy their write requests. Thus the i th core in sequence will have to wait for time equal to i 
cache misses before it can complete its write operation. So the total cache misses across all cores can be 



In the worst case, any core will incur B cache misses to read or write on any block. So, while distributing 
problems between different cores, to save cache cost, we will try to avoid block misses. 

1 These terms have been previously defined in [11) 
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The performance of any algorithm is characterized by two parameters. 
Cache misses : The total number of cache misses and block misses incurred during algorithm. We will 
analyze the total cache cost wherever any block miss occurs. 

Critical path : It is the maximum time taken by any processor in the overall algorithm. To decrease the 
critical path, we have to ensure equitable work-load distribution. 

Note : Apart from its close relation to parallel time, this also affects the overall cache misses in a multi- 
programming scheduling algorithm based on work-stealing [7J. 

Further, based on our a priori knowledge of multicore parameters (M,B,p) we have the following 
variations 

Cache Aware : The algorithm designer makes use of the values of multicore parameters (M, B,p) - for 
example, we can divide the problem according to values of M and B to better utilize the cache. 
Cache Oblivious : In this model we know the number of processors while designing algorithm but the 
values of M and B are unknown. However, the analysis can make use of these parameters and our goal is 
to match the performance of Cache aware algorithms. 

Resource Oblivious Model : In this model, even the number of processors is unknown to the algorithm 
designer. Usually the tasks are shared by processors using some on-line strategy like work-stealing methods 

m 

1.1 Previous Related Work 

Our results make use of Reischuk's [H] parallel randomized sorting algorithm that sorts n elements using 
CREW PRAM processors in 0(log n) time with high probability. For the external memory model, Aggarwal 
and Vitter [5] designed a version of merge sort, that uses a maximum 0(^log M / B N/B) I/O's and this 
is optimal. For the cache oblivious model, Frigo et al. [13] presented a sequential cache-oblivious sorting 
algorithm called Funnel sort which can sort n keys in O(nlogn) time and 0(^j log^/ n) cache misse^J Note 
that both time and cache misses achieved by this algorithm are optimal. 

Arge et al. [3] formalized the PEM model and presented a merge sort algorithm based on [10] that runs 
in O(logn) time and has optimal cache misses. Note that their model is both cache aware and processor 
aware. This algorithm is very similar to merge sort implemented on [2] . They proved these results assuming 
that n > pB 2 by using a d-way merge sort which partitions the input into d subsets, then sorts each subset 
recursively and merges them. To achieve optimal I/O complexity and parallel speedup, the sorted subsets 
are sampled and these sample sets are merged first. Then using the values of M and B, a suitable value 
of d is chosen to achieve the claimed bounds. 

Blelloch et al. [7] presented a resource oblivious distribution sort algorithm that has (9(log 2 n) critical 
path-length and incurs sub-optimal cache cost in the private-cache multicore model. Their distribution sort 
uses merging to divide the input into contiguous subsets and sorts them recursively. For their randomized 
version, they proved an expected 0(log 1,5 n) depth. The potential bottleneck for reducing the depth further 
was their technique for using merging to partition into buckets. The algorithm given in [22] is designed 
for a BSP-style version of a cache aware, multi-level multicore. It uses a different collection of parameters, 
and so it is difficult to compare it directly with the previous results. 

Recently Cole and Ramachandran [11] presented a new optimal merge sort algorithm (SPMS) for 
resource oblivious multicore model. This algorithm sorts n keys in 0(-logn + log n log log n) time using 
n processors with optimal number of cache misses on resource oblivious model assuming n > Mp and 
M > B 2 log B log log B. It works in O(loglogn) depth where each stage requires O(logn) time. The 
authors addressed a general computational paradigm called Balanced Partitioning Trees and designed a a 
resource-oblivious priority work scheduler based on work-stealing to attain the above bounds. It follows 
the same approach as merge sort where we splits the input into subsets, sorts the subsets recursively and 
then merges them (to achieve the desired bounds, the authors used multi-way merging). 

2 For tall cache, since log M n is 0(log M//s n/B), we will use the shorter expression 
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As sorting is basic problem in algorithms similarly Convex hull problem is a basic problem in Compu- 
tational geometry. There is a close relation between sorting and convex hull problem. It has been proved 
that sorting can be reduced to convex hull problem. So it means we can't solve convex hull problem faster 
than sorting. On a single processor convex hull problem can be solved easily by reducing it to sorting. 
For 2-D convex hull problem, on single processor, there exists a output-sensitive cache oblivious algorithm 
which achieve optimal time and cache misses [JJ. By saying output sensitive hull, we means that total time 
is proportional to number of points in output of convex hull. This algorithm is a modification of Chan's 
output sensitive algorithm [8]. On CREW model, there exists an algorithm which solve 3-D convex hull 
problem in expected O(logra) time using n processors [T7] , 

In the table below, we summarize the results discussed above where is used to represent expected bound. 



To sort n elements using p processors 
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1.2 Our Work 

In this paper, we have presented a randomized distributed sorting algorithm on cache-oblivious multicore 
model. Our basic approach is similar to Reischuk |14j . but to bound cache misses, we had to modify it 
significantly. First, we sample some elements from input, sort them using a brute- force method and use 
these elements to divide the input into disjoint buckets. To do this partitioning with minimal cache misses 
in a cache-oblivious fashion, we do it in two phases. Roughly speaking, we divide the n input keys into 
y/n buckets by successive partitioning into n 1 / 4 size buckets using a common procedure. This partitioning 
procedure, which is the crux of our distribution sort, in turn invokes an efficient merging procedure to 
attain the final bounds. 

We are able to sort n elements in expected 0{~ log n + log n log log n) time with expected ^0 
cache misses. These cache misses match the optimal cache bound 0(^logM for tall cache i.e. if 

M > B 2 . So the cache cost is optimal but time is optimal only for p < (n/ log log n). From the natural 
condition that n > Mp, if M > log log n it implies that p < (n/ log log n) and our algorithm will have 
both time and cache misses optimal. Our bounds for sorting match that of Cole and Ramachandran 
in cache-misses and depth and we can obtain matching performance in the cache-oblivious PEM model. 

3 It didn't account for extra costs like block misses 
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We also present a simple technique for processor-obliviousness where the algorithm need not have any 
prior knowledge of the number of processors and the processors can generate their ids on the fly. Our 
approach is fundamentally different from [7] that designs a scheduler to map tasks to processors in a 
resource-oblivious fashion. 

In practice, size of cache is 32KB or 64KB and size of block is 256B it means M > log log n condition 
is satisfied for input < 2^ 2 K So our algorithm have optimal time and cache cost with high probability 
for input of size < 2^ 2 \ 

Moreover, our approach yields a general framework for randomized divide-and-conquer that has other 
applications. In the latter part of this paper, we present a parallel multicore algorithm for constructing 
convex hulls of point sets. This is based on the approach of of Reif and Sen [IT] but we follow a simpler 
description given in [19J for planar hulls. We are able to apply a number of subroutines developed for the 
parallel sorting algorithm for the construction of the convex hull and achieve a bound similar to sorting. 
Since it is known that Voronoi diagrams can be reduced to three dimensional convex hulls, we obtain 
identical results as given above for sorting. 

2 Some basic algorithms 

The reader may recall that we are designing these algorithms without using the parameters M and B. 
First we mention some basic algorithms which we will need for our sorting algorithm. We will use T to 
represent total time (total operations), T to represent expected time, T" to represent parallel time, T" to 
represent expected parallel time and Q to represent total cache misses. 

Some times processors may have to read or write in contiguous locations which can lead to block misses. 
So we bound total cache misses as follows : 

Contiguous Reads and Writes. We have p cores and a total of n keys in the main memory stored in 
n/B blocks and every processor has to read equal number of keys. So the total cache misses for every 
core will be [pj] • Since a core may start or end reading in the middle of any block, the total cache misses 
across all cores should be < n/B + 2p which is 0(n/B) given our assumption that (n > Mp). Similarly, 
when p cores have to write a total of n keys in n contiguous locations and every processor has to write an 
equal number of keys n/p, then there will not be more than two block misses per processor for n > Bp. 

We now state results for some basic problems used throughout this paper. The associated algorithms 
are based on standard approaches and the bounds assuming that n > Mp. 

Lemma 2.1 Using p cores, maximum of n keys can be found in time 0(n/p + logp) with total 0(\n/B~\) 
cache misses. 

Proof: Assume input has been divided into p contiguous chunks each of size n/p. Every core is assigned 
a task to get a maximum element from one chunk. This will take \n/p~\ time and n/B + p cache misses. 
We get p elements. Use p/2 cores, where every core will read two keys and reject one key. This step will 
take 0(1) time and p cache misses. Repeat the same procedure using p/4 cores and so on until we are left 
with one input key. Total time for whole algorithm will be 0(n/p + logp) and total cache misses across all 
cores will be 0(n/B + p) which can be bounded by 0(n/B) if n > Bp. □ 

Lemma 2.2 Using p cores, n keys can be added in time 0{n/p + \ogp) with total 0{n/B) cache misses. 



Proof: This task can be done using the same approach as used in Lemma 2.2 within same bounds. □ 



Lemma 2.3 Using p cores, prefix computation on n keys can be done in time 0{n/p + logp) with a total 
of 0{n/ B) cache misses. 
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Input : (ki, k 2 , ■ ■ ■ , k n ) 

n 

Output : (fci, ki + k 2 , h + k 2 + k 3 , • • • , &i) 

i=l 

Proof: We will use an algorithm which does prefix computation on n keys in 0(n) time and 0{n/B) 
cache misses using one processor. [7]. Due to its Divide and Conquer nature, this algorithm can be easily 
adapted to run on more than one processor. To bound the block misses, we need to makes some small 
changes. 

We are given an input array A of length n, a temporary array S of length n — 1, and an output array R of 
length n. Say K=(n/p). Consider a balanced binary tree laid over input as shown in figure [l] (for n = 8). 
This balanced tree will be of size n — 1 and will be saved in array S. To make memory operations more 
cache efficient and avoid block misses (explained later) , save this tree in infix order in S as shown in figure 

m 




Figure 1: For n = 8, A is input array and S is temporary array 

This algorithm will work in two phases. In first phase, we will calculate the sum of the left subtree for 
each node and in second phase we will push this sum down the tree to calculate prefix sum. The recursive 
codes for the two phases are shown below : 

Phase 1: phasei(i, size). A is input array and S is temporary array where output of this phase 1 is stored. 

(a) if size = 1 then return A[i\. 

(b) S[i + size/2] = phase\{i, size/2) 

(c) return S[i + size/2] + phases\{i + size/2, size/2) 

Phase 2: phase 2 (i, size, s). Output of this phase is final output and is stored in array R. 

(a) if size = 1 then set R[i] = s and return. 

(b) phase 2 (i, size/2, s) 

(c) phase 2 (i + n/2, size/2, s + S[i + n/2]) 

Processor Allocation : After every level, no of processes get multiplied by two. So we know how 
many processes we would have at any time which means calls can be distributed easily to processors using 
processors numbers. Processors do not need any other information to execute these calls. Thus we do not 
have to spend any extra time in processor allocation. 

Analysis : Both phases use the same approach and same time and cache misses. We will analyze phase\ 
and same result follows for phase 2 also. Assume K=n/p. Each call to phase\ is divided into two calls 
of half sizes after constant computation and these both calls can be executed on different processors in 
parallel at same time. But if size of problem is < K then we have enough subproblems such that every 
processor can solve one problem individually and sequentially. So if size < K, all processes created by this 
call will be executed on same processor. This modification helps us to bound total number of cache misses 
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when size becomes below M, total cache misses will be bounded by M if the problem is solved on single 
core. 

Block Misses : 

Phase 1: Block misses occur when more than one processors are trying to write on same block. All written 
operations in phase 1 are done only on S array. Because of the way binary tree is stored in S, we are able 
to bound block misses. During phasei we are basically traversing tree stored in S from top to bottom 
(Figure [TJ where very node represents one call and two children represents two newly created sub-calls. 
We write one value on array S and create two new sub-calls to left and right subtree. All nodes of tree 
which are present at same level are being executed in parallel and each node is writing on one location in 
S. Distance between any two written positions will be equal to size of nodes on that level. So block misses 
are zero till problem size is > B. And when problem size goes < n/p, it is executed on single processor. In 
this subtree every processor is going to write contiguous elements (more than M). And when a problem is 
being solved on one processor than all calls are executed in depth first order which means that every core 
will write its elements in order from left to right in S which gives us zero block misses. So block misses are 
zero when problem size is greater than B or less than n/p. On the whole, we will have zero block misses. 
Phase 2 : In phases also we are traversing through a binary tree in which only leaves are taking part in 
writing. So similar to phasei, block misses are zero if n > Mp. 

Time and Cache misses : Say T(n,p) represents total time, T"(n,p) represents parallel time and Q(n,p) 
represents total cache misses taken by phasei for input of size n using p cores. 



T(n,l) 



Q(n,l) 



1 if n = 1 

O(l) + 2T(ra/2, 1) otherwise 

'n/B i£n<M 

O(l) + 2Q(n/2, 1) otherwise 



Solving these both equations, we get T(n, 1) = O(n) and Q(n, 1) = 0(n/B). 
Say K = n/p, where K > M 



T"(n,p) 



Q(n,p) 



In if n < K 

\o(l) + T"(n/2,p/2) otherwise 

(n/B if n<K 

\ O(l) + 2Q(n/2, p/2) otherwise 



Solving these both equations, we get T"(n,p) = 0(n/p + logp) and Q(n,p) = 0(n/B). 

This concludes the proof of Theorem |2.3| □ 



Lemma 2.4 Using p cores, a matrix of size m x n can be transposed in time 0(mn/p + logp) with total 
0(mn/B) cache misses given mn > Mp. 

Proof: As given in [p3]], Using one processor, we can transpose any matrix of size rnxnin 0(mn) 
time and 0{mn/B) cache misses. We will implement the same algorithm on multi-core model with slight 
variations to bound block misses. Given that matrix-es are saved in row major layout, the algorithm can 
be described with this simple recurrence relation : 

(0(1) if Max(m,n) < c 

0(1) + 2T(m,n/2) if n > m 

0(1) + 2T(m/2, n) otherwise 

where c is some constant. 
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1. If both m and n are constant, transpose it directly using some brute-force approach. 

2. Else, divide the whole matrix into two equal halves according to larger dimension. 
Implementing with p processors : Say K = mn/p and as given K > M. 



T"(m, n,p) 



' 0(mri) if mn < K 

0(1) + T"(m,n/2,p/2) if n > m 
0(1) + T"(m/2,n,p/2) otherwise 



which gives us 0{mn/p + logp) parallel time. 

After every level, both, size of problem and one of the dimension (row or column) is divided by two. As 
we keep on dividing the problem, at some level, mn becomes < M and whole problem can fits into cache. 
Because of dividing nature of this algorithm, at that time one of the dimension will be > \fM/2 and other 
will be > \[M. 

Processor Allocation can be done using same method as explained in Prefix Sum Computation (Lemma 



Cache Misses : 



2.3|) without any extra cost. 

Q(m,n,p) 



0(mn/B) if mn < K 

0(1) + 2Q(m,n/2,p/2) ifn>m/4 
0{1) + 2Q(m/2,n,p/2) otherwise 



which comes to Q(m,n,p) = 0{mn/ B). We have changed condition from n > m to n > m/4 to make 
sure that when every processor gets its individual problem, no. of rows are > no. of columns in that 
sub-problem. This will help us to bound block misses. 

Block Misses : After dividing given problem into enough sub-problems such that every core get its 
individual sub-problem, say of size x x y, where we know that xy = mn/p > M. 
Depending on m we have two cases: 

Case I m > B 

Because of dividing conditions of this algorithm, we can claim that x > \/~M. While writing transpose 
of matrix x x y in output, processor will be writing x contiguous elements y times which will incur 
zero block misses, if x > B. 

Case II m < B 

When dividing given problem into sub-problems, we will always divide the matrix by column side 



A[l,l] A[l,2] . . . A[l,n/p] 
A[2,l] A[2,2] . . . ARn/p] 

A[m,l] A[m,2] . . . A[m,n/p] 



A[l,n/p+l] A[l 1 n/p+2] . . . A[l,2n/p] 
A[2,n/p+l] A[2,n/p+2] . , . A[2,2n/p] 

A[m,n/p+l] A[m,n/p+2] , . . A[m,2n/p] 



A[l,2n/p+l] A[l,2n/p+2] . 


. ,A[l,3n/p] 


A[2,2n/p+l] A[2,2n/p+2] . 


. . A[2,3n/p] 


A[m,2n/p+l] A[m,2n/p+2] . 


. A[m,3n/p] 



Pi 



Input matrix (m * n size) 



Figure 2: Each core gets matrix of size m x n/p 
till every processor get one sub-problem. Every processor will have a sub-problem of size m x n/p 

8 



which is > M as shown in figure [2| Given that these matrix are saved in row major form, it can be 
easily seen that keys given to every core will be stored in contiguous locations in output matrix as 
shown in figure [3j And every processor have > M keys, so we will not have any block misses. 



A[l,l] A[2,l] ■ ■ ■ A[m,l] A[l,2] A[2,2] . . . A[m,2] . . . A[l,n/p] A[2,n/p] , , , A[m,n/p] 



Initial part of output matrix written by Pi 



Figure 3: Each core gets write mn/p contiguous elements 



This concludes the proof of theorem 2.4 



□ 



Lemma 2.5 Given n keys, rank of any key can be found in time Ol 2 + logp j with total 0(n/B) cache 
misses using p cores, given n > Bp. 

Proof: The whole task is divided into p processors such that every processor will find rank of given key 

among \n/p\ keys. It will take \n/p] time and total xp + pj = (n/B + p) cache misses across all p 
cores. 

Ranks found by all p cores are added to get the final rank of key among all n keys. For addition, we will 

use only p 2 /n cores(to avoid block misses). As p < n is given, so p 2 /n will always be < p. 

Using p 2 /n cores, p elements can be added in 0{n/p+\ogp) time and 0(p/B) cache misses, given p > Bp 2 /n 



or n > Bp (Theorem 2.2.) Total cache misses are bounded by 0(n/B), as p < n. □ 



Lemma 2.6 Using p cores, n elements can be written in contiguous locations in 0(n/p) time and 0(n) 
cache misses, if n > Bp. 

Proof: The crux of this step is to avoid block misses. For that reason we make sure that every processor 
gets more than B contiguous locations to write its keys. This lemma is generally used in cases when 
processors work on disjoint problems, get some output and finally write all those outputs at contiguous 
locations. 

Every processor is assigned an equal number of keys to write. The first processor reads the first \n/p\ keys 
from the input and writes them in the first n/p locations of the output. Similarly the i th processor reads 
the i th block of n/p keys from the input and writes in contiguous locations starting from {n/p X (i— 1) + l) th 
location of the output. It will take \n/p~\ time and n cache misses to read and write all keys. Total block 
misses are zero if n/p > B. □ 



2.1 Brute- force Sorting 

Lemma 2.7 Using Brute-force approach, n keys can be sorted in o(^- + logpj time using p cores with 
total cache misses 0(n 2 /B + n), given n 2 > Bp. 

Proof: To sort these n keys, rank of every key is found in parallel using (p/n) processors. 

n i £ ] _ ( 

B ' n I \ B 



It will take O [ \ + logp I time and total nx ( ^ + 2 = l^-j-p) cache misses across all p cores (Lemma 



2.5). We have one output array B of length n where every input key will be written at position given by 
its rank. To avoid block misses, use the following strategy. 

Create one more output array A of length n 2 and write the i th key at the location n • Ri of A (where Ri 
represents rank of i th key.) This is done in \n/p] phases, by writing p keys in every phase. In the first 
phase, all those cores whose keys have ranks cn/p (where c is to n — 1), will write their keys and rest of 
the cores will remain idle. It will take 0(1) time, p cache misses and zero block misses if n 2 > Bp. In the 
next phase, all those cores whose keys have ranks cn/p + 1, (where c is to n — 1) will write their keys. In 
total, we will have total of n/p such phases and a total of 0(n) cache misses across all phases. 



2.6 



Finally, array A will contain all n input keys spread-ed in n 2 locations in sorted order. Using Lemma 
we can write these n keys at continuous location using \p/n\ processors in 0(n 2 /p) time and 0(n) cache 
misses, if n > Bp/n or n 2 > Bp. □ 



Corollary 2.1 Using brute-force approach, n keys can be sorted in time O ^ ^ +logp j with total 0(n 2 /B) 
cache misses using p cores, given n 2 > Bp and n > B. 



Proof: This is a simple instance of lemma 2.7 when n> B. □ 



2.2 Sampling 

Lemma 2.8 Given a set A of n input keys and by using p cores, where n > Mp, a set S of size n l l x 
(where x > 4) can be chosen from A in 0{n/p + \ogp) time and 0(n/B) cache misses such that S satisfies 
the following property. 

Suppose to, . . . ,i/-fi is one of the longest sorted subsequence in sorted A such that io>i/+i £ S but 
h, t2, ■ ■ • , tf ^ S. 

Then Pr(f > (1 + t- 1 / e )n 1 - 1 / x = (t 1 ^* 1 ^)- 1 , where t = ^jn x l x . 

Proof: This is done using the sampling technique given by Reif and Valiant [15] . 

1. Choose another set S* of size n 1 / 2 from A randomly. The probability for any element being chosen 
in S* is equal. To choose S* , imagine that input has been divided into n 1 ' 2 contiguous chunks all of 
equal size and one key is selected randomly from every chunk. It will take {\y/n/p~\) parallel steps 
and total n 1 / 2 cache misses which is bounded by n/B if n > B 2 or n > M and m > B 2 . 



2. Using Lemma 



2.1 



S* can be sorted in time Ol ^ +logpJ using p cores with total O(^) cache misses 
if n > Bp and ^Jn > B which is true as n > Mp is given. 

3. To get required set S of size choose l(y/n/n 1 ^ x ) th , 2(^/n/n 1 / x ) th , . . . , y/n th element from S* . It 
will take maximum ( \n l l x /p~\ ) parallel steps and total n l / x cache misses which is bounded by n/B if 
n > B 2 . 

4. Write these [~n 1//:r ] elements at contiguous locations using \pn}l x /n\ cores in \n/p~\ time and \n/B~\ 
cache misses. 



From the results in [15], we can claim that Pr(f > (l + t 1//6 )n 1 x l x = (t 1 / 2 2^ 1/2 ^) 1 , where t = y/n/n 1 ^ 



. □ 



3 Randomized Divide and Conquer 

Our basic framework for randomized divide-and-conquer is exemplified by Resichuk's [13] parallel sorting 
algorithm. We are given n keys in beginning A = (ki, A;2, • • • , k n ). The basic steps of algorithm are : 
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1. (Random Sampling to divide) Choose a subset S of size \fn from A randomly. 

2. Sort S by comparing every pair of keys in S. 

3. (Partitioning Step) Using S, divide A into y/n intervals Bq, B\, ... , B^ where Bi contains those 
keys that are bigger than i th smallest but not bigger than (i + l) th smallest element of S. 

4. (recursive call) Sort each box recursively independent of other boxes. 

The above algorithm runs in O(logn) steps with high probability in an n processors CRCW PRAM model 

HI- 

3.1 Adaptation into Multicore 

We now present a variation of the above algorithm on Multi-core cache oblivious model to achieve the 
same kind of bounds as the PRAM model. We will use the following notations for the various performance 
measures: 

T : total time (total operations) T : expected time T" : parallel time T" : expected parallel time 
Q : total cache misses Q: expected cache misses. 

Theorem 3.1 Using p cores, we can sort n general keys in expected time log + log n log log n) and 
expected cache misses 0(^\og M n), assuming M > B 2+a where a = 2/31 and p < jfe. 

Proof: We are given a set A of n keys (k\, &2j • • • > k n ) in shared memory. We have p cores each with cache 
size of M. 

Our algorithm divides the problem into disjoint subproblems recursively until we have the number of 
subproblems matching the number of processors. Every subproblem is assigned processors according to its 
size so that (n/p) ratio is same for every sub-problem. As all the subproblems at same level have nearly 
equal sizes, so when problem size becomes lower than this ratio then it will mean that we have enough 
number of subproblems. (Note that we know p.) 

We represent the initial input size as N and total processors given as P. After every step we check whether 
problem size < (N/P) if yes, each subproblem is assigned to an individual core otherwise partition it 
further into smaller subproblems. 
The algorithms is executed in following steps : 



Distribute n keys into m buckets 



Assuming input set can be represented 

by Vn continuous chunks. Distribute 
each chunk into Vm buckets recursively 



Merge these Vn chunks to get a single 
list divided intoVm buckets 



Distribute each bucket into Vm buckets 
recursivelv 



Figure 4: The overall recursive partitioning scheme 
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1. If the problem size < (N/P) then solve this problem on one processor sequentially. 

Note that n keys can be sorted using one processor in time 0(n\ogn) with 0(% ^og M n) cache misses (113/) 



We choose a set S of size [n 1 /^] , where x > 4 from A, using Lemma 2.8 This task can be done in 0(n/p+\ogp) 
time and 0(n/B) cache misses using p cores, given n > Mp. We will decide the value of x later during the 
analysis. 



3. Using S, divide A into |5| buckets B\, B2, . ■ . , -Bigi where Bi contains those keys that are greater than 
smallest but not greater than (i + \) th smallest element of S. 
In other words, we have max {x | x G Bi} < Si < min {x | x £ Bi + \\. 

For this step we can follow the same approach as used in Reischuk sorting |14j . where we assign every key to 
one processor and let it find the bucket number of this key using binary search on given 15*1 bucket indexes. 
Then we do integer sorting to move all the elements to the right buckets. But even binary search step could 
cost us n\og \S\ cache misses where our target is § log M \S\ cache misses. To avoid this we use the following 
strategy 



Using Theorem 5.1 we can do this task using p cores in time 0(^logn + log n log log n) with a total of 



0(-g log M n) cache misses, provided \S\ < ^fn, P < jj and n > B 2 ^ 1 log ™ ^ s ^p. As \S\ = n Y l x , this condition 
can be reduced to n > B 2 l^ 1 l x) p or n > B 2 B 2 ^ x ~ 1 ^p. 

The original problem has been divided into [n 1 / 21 ] subproblems where each subproblem can be solved inde- 
pendently. We claim that the size of largest subp roble m will be < (1 + i -1 / 6 )™ 1-1 / 21 with very high prob- 



ability (which is 1 — l/(t 1 / 2 2( t ' using Lemma 2.8 where t = ^fn/n 1 ! 1 . Therefore, a subproblem size 
is < (1 + n ( 2 - x )/ 12x ) n l - 1 / x w.h.p. otherwise, we repeat the partitioning step again starting from step two. 
Distribute processors between subproblems according to their sizes and these subproblems are solved again 
recursively using this same procedure. For the n x / x subproblems, processors are assigned to a problem such 
that the (n/p) ratio is same. For example, if the size of the i th problem is rii then it gets pi — pni/n processors. 

We will do prefix computation for the processor allocation. This takes o( ^ + logpj time using p-^— cores 
with total cache misses 0{n^ x /B), given n x l x > M ( p rL - 1 — ] or n > Mp. This concludes step 4. 



Recall that to achieve the claimed bounds in steps 2 to 4, we require ^ > max{M,B 2 B 2 ^ x 



3.2 Detailed Analysis 

We represent the initial input size as N and the total number of processors given as P. As ratio n/p remains 
same throughout the algorithm, so if it is given that f > max{M,B 2 B 2 ^ x -^}, condition given above will 
always be satisfied till the problem size > (N/P). When problem size becomes N/P, it means we have enough 
number of sub-problems so that we do not need to run steps 2 to 4, we solve the problem directly using step 1. 
We can summarize the performance of our algorithm as : 

Given, f > max{M, B 2 B 2 ^ X ~^}, say K = N/P, let r be the probability of bad case (when size of largest 
sub-problem is > (1 + n( 2 ~ x ^ 12x )n 1 ~ 1 / x ), x > 2 and P < N . The expected parallel time satisfies 



T"(n) = \ °{p l0g " + ksnloglognj + (1 - r)T"(m) + rT"(n) if n > K 
y 0(K log K) otherwise 

For n > 2 18 , we can choose x s.t. rij < n 1 ~ 1 / 2x . Since r < 1/n, the previous recurrence can be rewritten as 

f"(n) = i°( Klo & n + lo S nlo S lo S n ) +T"(n l ) if n > K 
[ O (K log K) otherwise 



It follows that T"(n) = 0\^x^\ogn + x log n log log n j implying T"(n) = ^logn + log n log log n ). We 

have the following two cases depending on number of processors. 
Given n > Mp and n > B 2 B 2/{x ~ 1 '>p where T"(n) = 0(~ \ogn + log n log log n) . 



also known as semi-sorting 
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(a) If M > log log n or B 2 B 2 ^ X ^ > log log n then Mogn > log n log log i 
so f"(ri) = 0(|logn). 

( b ) If logTogn then f lo S n > log n log log n 
so T"(n) = 0(|logn). 

Likewise the total expected cache misses for all p processors 



Q(n) 



0(f log M n) + ^QK) ifn>^ 

i=l 

0(ji logjvf K) otherwise 



(3) 



This yields Q(n 4 ) = 0(^± log M n*) + ^ Q(n u ). 

i=l 

Since m < n 1 ^ 1 /^ and ^ n, = n, ^ log M m) < log M n {1 ~ 1/32x) ). It follows that Q(n) 



rii = n 

i=l i=l 

log M n) implying Q(n) = 0(f log M n 



Notice that the bounds on time and caches misses are expected over the choice of the random sample. We 
have chosen x = 32, so if we assume that M > B 2 B 2 / Z1 then the condition § > max{M, B 2 B 2 ^ X ^^} can be 
simplified to N > MP. This concludes the proof of Theorem |3.1| 

□ 

Remark : Using the analysis in |151I17|. we can obtain high probability bounds for parallel time and cache 
misses. 

In the next subsections, we sketch some details of the individual steps. 



4 Merging algorithm 



Input ( Total xy keys ) 




Bn 


B 12 




Blm 


B 21 


B 22 




B 2 m 




Bxl 


B*2 




Bxm 



Bu 


B 2 i ■■■ B xl B 12 B 22 ... B x2 


■■■ B lrr B 2 m 


... B xnn 




Output (Total xy keys ) 










xy/ p keys 
written by pj 


xy/ p keys 
written by p 2 


xy/ p keys 
written by p p 





Figure 5: Each core write contiguous xy/p elements in output 
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4.0.1 Basic structure of the Merging algorithm 

Given x lists each of size y and all divided into m buckets, merge all these lists to get one single list divided 
into m buckets. 

Let Bij represent the i th list and j th bucket where 1 < % < x and 1 < j < m. As shown in Figure [5J 
initially we are given input in form of B u , B 12 , B lm , B 21 , B 22 , ■ ■ ■ , B 2m , ... , B x i, B x2 , . . . , B xm 
and p processors. We will divide the input into p parts such that each core will get to write equal number 
of keys and in a manner that in the output, the first processor will write the first ^ keys, the second 
processor will write next ^ keys and so on. These processors may have to read their keys from different 
locations but while writing, they will write contiguously. To find the keys, every processor will need the 
information about the size of every bucket of every list. This will be done using prefix sum computation. 
So whole task can be divided into three steps : 

1. Every core needs to get the information about xy/p keys it must write. This is done using prefix 
sum computation. 

2. Every core will read exactly xy/p keys using the information generated in first step. 

3. Every core will write its xy/p keys in contiguous locations. 

4.0.2 Detailed explanation of Merging 

Theorem 4.1 Given x lists of total size y, that are indexed by t buckets, we can merge these lists into a 
single list partitioned into t contiguous buckets in time O^+logp^j using p processors incurring 0(j^+xt) 
cache misses, provided y > Mp and y > xt. 

Proof: Let T merge (x,y,t) represent the total time taken and Qmerge(x,y,t) represent the total number of 
cache misses to merge x lists of total size y and all divided into t buckets using p processors. 



Every core needs to know all buckets size of every list to find y/p keys it has to write. There are a 
total of xt such sizes. This information can be processed by first transposing given array of size x x t 
and then doing prefix sum computation on these xt elements. 

Using (xpt/y) cores, both of these tasks can be done in time 0(y/p + logp) and 0(xt/B) cache 



misses, given xt > M (xpt/y) or y > Mp (Lemma 2.4 and 2.3). We have enough cores for this task if 

y > xt. 

2. From the information found in Step 1, we need those sizes which have values y/p,2y/p, . . . ,py/p. 
Every core is assigned a task to find these sizes from xt/p sizes. Create an output array of length 
y, and if the size cy/p is found, it will be written at location cy/p of this output array. For reading 
and searching, it will take \xt/p~\ time and +p] cache misses. For writing, the total time will be 
\xt/p~\ , and the total cache misses are p (and zero block misses if y/p > B). 

3. These p points found in Step 2 are written at contiguous locations in \y/p~\ time with 0(p) cache 
misses, using p 2 /y cores, given p > Bp 2 /y or y > Bp (Lemma 2.6). 



Every core will read exactly y/p keys which will take \y/p\ time. 

Now we will bound total number of cache misses to read these y/p keys. All the processors will read 
their input keys sequentially unless two event happens as explained below : 

(a) The bucket that core was reading, ends. So in this case the processor may have to go to next 
list or next bucket which can increase cache miss count by at most one over sequential misses 
and this can happen only xt times. 
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(b) Processors may have to start or end reading in the middle of some block which can increase 
cache miss count by at most by one or two over sequential misses. This can happen a maximum 
of p times. 

The total cache misses can be bounded by 0(y/B + xt + p) which is 0(y/B + xt) for p < y/B.. 

5. Every core write its y/p keys sequentially that leads to a total of 0(y/B) cache misses. As > p, 
every core has more than one block to write and since every core is writing sequentially so there will 
not be any block misses. 



This concludes the proof of Theorem 4.1 . □ 



The following is a simple corollary of the above Theorem. 

Corollary 4.1 Given x lists, each of size y, that are partitioned into t buckets , we can merge these lists 
into a single list partitioned into t buckets in time o(^^ +logp^ using p processors with total + xt) 
cache misses, given xy > Mp and y >t. 

Corollary 4.2 Given x lists, of total size Y , we can merge these lists into a single list in time O^^+logp 
using p processors incurring a total of + x) cache misses, given Y > Mp. 

Proof: By merging, we imply writing the input lists in contiguous locations without changing the order of 
elements, i.e., L±, L2, -L3, . . . , L x . This Lemma is used in cases when different processors work on different 
problems and generate some output, and we need to write the output in a sequence. So every list can be 
thought of as one bucket and the final list is also one list divided into one bucket. This can be done by 



using Theorem 4.1 Given x lists of total size Y, we can merge these lists into a single list (divided into 



one bucket) in time ^(33 + logpj using p processors incurring a total of 0(]j + x) cache misses, given 

Y > Mp and Y > x. Note that Y is the total size of x lists and every list has at least one element so 

Y > x. □ 



5 Dividing input into buckets 

5.0.3 Basic structure of algorithm 

Here we give a brief description about algorithm and later follow up with detailed time and cache analysis. 
Given n keys and m bucket indexes X\,X2, ■ ■ ■ ,X m (given in sorted order), divide n keys into m buckets 
B\, B2, . . . , B m such that Xi < {x \ x S Bi} < Xj+i. 

Let T(n, m) represents the total time to divide n keys into m buckets. 

First we divide n keys into yfm buckets with splitters (X^, X 2y ^, ■ ■ ■ , X^^). For every i, 1 < i < y/m, 
B. L is again divided into Bn,Bi2, . . . , -Bj v ^; as shown in figure [6] : 

1. Divide the n keys into y/m buckets. This is done in two steps. 

(a) Divide n keys into y/n contiguous chunks of size y/n each. Now each chunk of size y/n is divided 
into yfm buckets recursively. 

(b) We get y/n lists, each divided into yfm buckets. Merge these lists to get a single list divided 
into yfm buckets. The merging algorithm is explained in the appendix. 

T(n, yfm) = yfnT(yfn, yfm) + T merge (yfn, yfn, yfm) 
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Input (n keys) 




Figure 6: As shown for Bi, all buckets are divided into yfm buckets 

where T merge (y/n, yfn,yfm) represents time needed to merge yfn lists, each of size yfn and all 
divided into yfm buckets (explained in section B . 

2. Now we have yfm buckets n\,ni v . ■,n/^. Each bucket is again divided into yfm buckets. 

T(n, m) = T(n, yfm) + T(m, \fm) 

i=l 

We follow the same approach as done in step 1 above with some modifications. The following steps 
are done for every bucket : 

(a) Divide bucket into contiguous chunks of size yfn. All these chunks(except last) will be of size 
yfn and last chunk will have size < yfn. 

(b) All these chunks (except the last) are divided into yfm buckets recursively. 

(c) The last chunk is divided into yfm buckets directly with the help of binary search and sorting. 
We will explain this step in detail later. 

(d) We have divided all these chunks in yfm buckets. Using our merge procedure, merge all these 
chunks to get one list divided into yfm buckets. This concludes step 2. 

Figure [4] gives a high level structure of this scheme. Finally, we obtain the following recurrence for the 
parallel time. 

T (n, m) = 0(n/p + logp) + 2T"(yfn, yfm). 

5.0.4 Detailed analysis 

Lemma 5.1 Using one processor, n keys can be divided into x buckets in 0(n log n) time with 0(% log^/ n) 
cache misses, given n > M and x < n . 

Proof: This problem can be easily reduced to sorting, so it will take the same time as that of sorting. 
We can sort n keys in time O(nlogn) with 0(-glog A /n) cache misses using one processor ([13]). After 
sorting, input is divided according to the bucket indexes. We just need to calculate the size of each bucket. 
Assuming bucket indexes are also given sorted. Start from the first bucket index and compare it with every 
key from the input and whenever it becomes larger than this bucket index we switch to the next bucket 
index and start comparing it with next key from input. It will read both the input and the bucket indexes 
contiguously from start. For x < n, the total cache misses for this step are bounded by 0(n/B). □ 
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Lemma 5.2 Using p cores, n keys can be divided into x buckets in + logp ) time with 0(n 2 /B) 

cache misses, given n 2 > Mp and x < n. 



Proof: We will solve this problem using the same approach as used in Lemma 5.1 



Sort n keys in 0[ ^- + logp ) time with total 0(n 2 /B + n) cache misses, given n 2 > Bp (Corollary 



2.1). 



The number of cache misses are bounded by 0(n 2 /B) if n > B (follows from n > \AM and M > B 2 ). 
After sorting, input keys has been divided according to buckets. To calculate the size of each bucket we 
assign every bucket index to one processor; say processor pi is assigned i th bucket index, pi does binary 
search to find the size of its bucket on n keys. 

To do this, p processors will take 0( xlo ^ n ) time and 0{xn/B) cache misses. Given x < n, the time and 
cache misses can be bounded by 0( 1 y) and 0(n 2 /B) respectively. □ 

(3/2 \ 
^ hlogpj time with total 

0(^ — |- y^fn) cache misses, given n > (Mp) 2 / 3 and y < ^fn. 

Proof: Let T(a, b) represent the total time to divide a keys into b buckets. The whole algorithm can be 
described by the following recurrence relation. 



T(n, y) = y/nT(y/n, y) + T merge (y/n, y/n, y) 



where T merge (y/n, y/n,y) represents time needed to merge y/n lists, each of size ^fn and divided into y 
buckets (explained in section [4]) . 



1. Divide n keys into ^/n contiguous chunks, each of size y/n. All of these chunks are divided into ^Jy 
buc kets in parallel. Every T(^/n,y) call is assigned to p/y/n processors and is solved using Lemma 



It will take o( 1 ^ 2 - + logp j time and 0(n/B) cache misses if n > M(pj ^fn) or n > (Mp) 2 / 3 
and y < y/n. Total cache misses for y/n calls will be 0(n 3 / 2 /B) and the time is o(^^- + logpj . 



2. From Step 1, we get y/n lists, each divided into y buckets. Merge these yfn lists to get a single list 
divided into y buckets. Only p/y/n processors are used in this step to avoid block misses. 

/ 3/2 \ 

Using p/\/n processors, merging will take 0\ ^— — h logp ) time with 0(n/B + yyn) cache misses, 



given n > (Mp/^Jn) or n > (Mp) 2 / 3 and y < y/n (Corollary 4.1). 



□ 



Corollary 5.1 Using (p/n 1 / 4 ) cores, n 1 / 2 keys can be divided into y buckets in \ ^ + logp ) time with 
total O^/^/B + y\/n) cache misses, given n > Mp and y < y/n. 



Proof: This is an instance of Lemma 5.3 with \fn input andp/n 1 / 4 cores which lead to total 0(n 3 / 4 /B + 



yifn) cache misses. To satisfy conditions given in Lemma 5.3, we need \fn > (Mp/n 1 / 4 ) 2 / 3 or n > Mp 



and y < ^fn. □ 



Theorem 5.1 We are given an array A containing n input keys and an array S containing z bucket indexes 
(stored in contiguous locations) . We can partition A into z buckets B\ , Bi , ■ ■ ■ , B z such that for 
i = 1,2,. . . z, we have max {x \ x £ Bi} < Si < min {x \ x E Bi + \}, using p processors in time 0(p logn + 
log n log log n) and a total of 0(^log M n) cache misses, given z < ^fn and n >max{Mp, _B 2 /( 1_1 °s™ . 



17 



Proof: Let T(a,b) represent the total time, T"(a,b) represent the parallel time and Q(a,b) represent the 
total cache misses to divide a keys into b buckets. 

All the processors work together to divide the problem into smaller size but when no. of sub-problems 
becomes more than the no. of processors then every core can have one problem and solve it independently 
using Lemma 5.1 During recursion, all the sub-problems at the same level have the same size. Every 
sub-problem is assigned processors according to its size so that (n/p) ratio is same for every sub-problem 
at any level. When problem size becomes lower than this ratio then it will mean that we have enough 
number of sub-problems to solve them on one processor individually. 

Let initial input size be represented by ./V and the total number of processors by P. It is assumed that 
N > MP. The whole task can be divided into these steps. 

1. If problem size < (N/P) then solve this problem on one processor otherwise break this problem into 
smaller sub-problems using the following steps. 

Using Lemma 5.1, n keys can be divided into z buckets using one processor in time 0(n log n) with 
0(-g log M n) cache misses, given z < n. 

2. Choose \Tz , 2^/z th , 3y/z th , . . . , \fz,\J~z h element from the given z splitters. We obtain a total of \fz 
elements. First we divide input keys into these buckets and each bucket is divided further into \fz 
buckets. For this step, it will take maximum (\z 1 / 2 /p~\) parallel steps and one cache miss for every 
chosen key. Total cache misses will be z 1 ^ 2 and can be bounded by n/B as z < \fn and n > B 2 . 



3. Write these f^ 1 / 2 ] elements at contiguous locations using \pz l l 2 jn\ cores in \n/p~\ time, \n/B~\ cache 
misses and zero block misses, if z 1 / 2 > Bpz 1 ! 2 jn or n > Bp. 

4. Divide the n keys into these \fz buckets as given as follows : 

T"(n, yfz) = T(s/n, yfz) + T^ erge (y/n, \/n, \fz) 

Q(n, \fz) = y/nQ(y/n, yfz) + Qmerge(Vn, \/», V^) 



(a) Divide whole input array into y/n contiguous sub-arrays of size y/n. Recursively divide each 
sub- array of size \fn into \fz buckets. 



(b) We get y/n lists, all divided into y/z buckets. Using Corollary 4.1 merge these lists into a single 
list divided into y/z buckets. Given, n > Mp and z < n. 

we get T"(n, J~z) = T"(y/n, sfz) + + logp^ (4) 

Similarly for cache misses Q(n, ^fz) = \fnQ(^/n, \fz) + 0(n/B + \fnz) 



5. The entire input has been divided into \fz buckets and we need to divide each bucket again into \fz 
buckets. 



1/2 



T{n,z) =T(n,^) + ^T(n i , v ^) (5) 



i=i 



Break this T(nj, \fz) in chunks of T(y / n, \J~z) and then merge these chunks using Corollary 
can be broken as n. L = Xi ^fn + y; L where yi < ^Jn. Xi < \fn and yi < n 1 / 2 



4.1 



T(ni, ■s/z) = xiT(^/n, y/z) + T merge (xi, y/n, y/z) + T(y/n, \fz) 
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Summing this for all z 1 / 2 problems, we get 

z x/i z l/2 

T(ni, sfz) = y/nT(y/n, yfz) + T merge (xi, y/n, yfz) + z 1/2 T(n 1/2 , yfz) (6) 

i=l i=l 

We have total p processors and z 1 / 2 calls to T(n 1//2 , V^)- Assign p/n 1 / 4 processors to each call and 



solve it using Corollary 5.1 As z < y/n so we have enough processors for z 1 / 2 calls. Each such 
call will take + log p \ time and 0(n 3 ^/B + yfzyfn) cache misses if i/i < y/n or z < y^n and 

n > Mjj. For all n 1 / 4 calls, total cache misses will be 0(n/B + y/nz). 
Combining equation Q,([5]) and Q, we get 



T"(n, z) = 2T"(y/n, v^) + + logpj + T^ erge (xi, y/n, y/z) 

Similarly for cache misses. 



1/2 



8=1 



Using Corollary 4.1 



T merge(Xi, y/n~, y/z)= O I 2 + logp 



For cache misses, Qmerge( x i) sf^i 

^/i)=0(^ + SiVS) (Corollary 



4.1) 



2 l/2 2 l/2 

which gives ^ Merge Q (xi, y/n, pi) => ^ ^ a^v 7 ^) 

i=i i=i 

2 l/2 

■y/n 



i=l 

Finally equation ([7]) can be rewritten as 

T"{n,z) = 2T"(V^, v^ + of^+logp^ 

Likewise for cache misses, using n > Mp, equation Q can be reduced to 

Q(n, y/n) = 2 v / nQ( v / n, v 7 ^) + 0(n/£ + v^) 



(7) 



Q(n, z) = 2y/nQ(y/n, y/z) + 0(n/B + v / nje) + Qmerge{xi, y/n, y/z) (8) 



Assign pj processors to every bucket rij such that pi = prii/n or to be precise j>j 



h log pi J (9) 

and Xiy/n > Mpi if n > Mp and 2 < n. As < p, so equation ([9]) can be reduced to 



The number of cache misses 0(n/B + \/^) can De bounded by 0(n/B) if -y/raz < n/B or n > zi? 2 . 
Say z = n 1 /* =>■ t = log z n. Condition n > Z-B 2 can be further reduced to n 1-1 /' > i? 2 =4> n > 

_gHvt n > £2/(l-log n *) 

Finally, to complete all the steps of algorithm in claimed bounds, we need n > max{Mp, B q }, where 
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g = 2/(1 — log n z). As mentioned earlier, ratio (n/p) remains same throughout the recursion proce- 
dure and the same is true for q also. Since after every iteration both n and z are reduced to \fn, \J~z~ 
and log n z remain same, which means q will also remain same through the recursion. 
If it is given that N > max{MP, B q P}, condition given above (n > max{Mp, B q }) is true till prob- 
lem size > (N/P) and when problem size becomes N/P, it means we have enough problems such 
that every processor can be assigned one problem. 

Processor Allocation : During every iteration of this algorithm, one problem of size n is divided 
into yjn sub-problems. Processors are assigned to every sub-problem such that (n/p) ratio is same for 
every sub-problem. The i th sub-problem with size rij gets pi processors such that pi = pni/n, where 

n 1 / 2 

rii = n. After doing prefix sum computation on sizes of all n 1 / 2 sub-problems, every processor 

i=i 

can be assigned to its respective sub-problem. It will take O + logp^ time using p/ y/n cores with 
total cache misses 0(y/n/B), given n > Mp. 



The whole algorithm can be summarized as : 

if N > max{MP, B q P}, where q = 2/(1 - log n z). 



r^iH ^' 08 ^ 2 ^^ ,( "- N/P do) 

log ^ ) otherwise 



The total number of cache misses for all p processors 



, 0(n/B) + 2VnQ(Vn,Vz) Hn>N/P 
Q(n, z) = \ N N (11) 

0(^p log M ^ ) otherwise 



Assume ^ = K, where K will be constant throughout the algorithm as explained before 

T"(n,z) 



a(K + logn) + 2T // (Vn, v / ^) if n > if 
bK log K otherwise 



We get, T"(n,z) = a(K + logn) + a(2K + log n)+ . . .+ a(2 x K + logn) + b(2 x+1 KlogK) 
=► T"(n,z) = 2 X+1 aK + ax log n + 2 X+1 bK log K 

Using („)V > K 2 X < 

We get, T"(n, z) = 0(K^ + log logn + K logn), which comes to 

T"(n, z) = 0(K logn + lognloglogn) Or 0(^ logn + log n log log n) . 
The total cache misses for all the p processors 



an/B + 2^fnQ(y/n, s/z) if n > K 
bjj log M K. otherwise 



QM = \,k, I . ( 12 ) 



Q(n, z) = a(§) + 2a(l) + 4a(§) + . . . + 2 x a(%) + 2 x +^a(^ log M K) 
=> Q(n, z) = 2 x+1 a(%) + 2 x+l a(% log M K) 
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Using (n)^ x > K ==>■ 2 X < 

We get, Q(n, z) = 0(§ log^ n + f log M n) = 0(§ log M n) if if > M or n > Mp. 

□ 



5.1 Multi-search 

Lemma 5.4 VFe can simultaneously search for n elements in a sorted set of m elements using p pro- 
cessors in logn + log n log log n) time with total 0(^log M n) cache misses, given m < y/n and 
n >max{Mp,B 2 l^-^ m )p). 



Proof: This follows directly from Theorem 5.1 The basic approach goes back to Reif and Sen |18j . 
The condition n > i? 2 /( 1_1 °Sn m )p can De simplified ton > B A p when m = \fn. As the value of m decreases, 
this condition becomes progressively weaker. For m = n 1 / 32 , condition is reduced to n > i?( 2 + 2 / 31 )p. □ 



6 Convex Hull construction 

An object is convex if for every pair of points within the object, the straight line segment that joins them 
lies completely within the object. 

Given a set of points P, the convex hull CH(P) is the smallest convex object which contains P. 
Because of close relationship between convex hull and sorting, we will try to prove the same time bound and 
cache cost for convex hull problem as our sorting algorithm. Our algorithm is based on the algorithm given 
in Reif and Sen [17J and our description follows the randomized divide- and-conquer strategy described 
in pjj]. It has been known that the convex hull problem can be reduced to the problem of finding the 
intersection of half-planes (under a dual transform) . So we will solve the problem of finding the intersection 
of half-planes as it is relatively easier to divide it into disjoint sub-problems. 

7 Basic Algorithms 

Before we describe the main algorithm, we briefly describe some of the basic subroutines that are used 
frequently. 

7.1 2-D Maxima Problem 

A point p = (p x ,Py) dominates a point q = (q x ,Qy) if and only if p x > q x and p y > q y . The maxima 
problem can be defined as : Given a set of points, to report the maximal points within the set i.e. to report 
all the points which are not dominated by any other point in this set. 
This problem can be divided into following steps : 

1. Sort the given points along x-axis to get a set T = {t\, t2, • ■ • , t n }. 

2. It is obvious that t n is a maximal point. Start sweeping from t n towards left (in decreasing value of 
x). The first point we encounter is p n -i- We already know that t n -\{x) < t n (x), so if i ra -i(2/) < t n (y) 
then t n -i is dominated by t n . Similarly, for the next point in left direction i n _2, we just need to 
compare t n -2(y) with max{i n (y), t n -i(y)} to verify whether point t n -2 is maximal point or not. 

In short, to check any point ij whether it is maximal or not, we just need to compare U(y) with 
max{tj + i(y), tj + 2(y), . . . , t n (y)}- While sweeping to left, maintaining the value of the largest y- 
coordinate till that point is enough for comparison. 
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lb) 



Figure 7: (a) Input points (b) Maximal points form the stairs 

Lemma 7.1 From a given set S of n points, maximal points can be found in 0(n log n) time with total 
0(§ log M n) cache misses, using one processor, given n > M. 

Proof: The whole task can be divided into following two steps : 

1. Sort given n keys w.r.t. x-axis in O(nlogn) time with total 0(^log M n) cache misses using one 
processor, given n > M. [|16j]. 

2. As explained above, this sorted set is read sequentially which will cost 0{n/B) cache misses. We 
need to maintain one max variable which will be accessed every time next element is read from input, 
because of LRU replacement policy, it will not cost any extra cache miss. 

□ 



7.2 Finding maximal points on p cores 

Lemma 7.2 From a given set Sofn points, maximal points can be found in expected log n+log n log log n 
time with expected 0(^log M n) cache misses using p cores, given n > Mp and M > B 2 B 2 / 31 . 

Proof: The whole task can be divided into following steps : 

1. Sort input points w.r.t. x co-ordinate in expected time logn + log n log log n) with expected 



0(jj logjy// n) cache misses using p cores, given n > Mp and M > B 2 B 2 ^ 31 (Theorem 3.1 ) 



2. Assume input has been divided into p contiguous chunks fei, &2, . . . , k p , each of size n/p. Each core 
is assigned one chunk to find maximum element say y, from it. To do so, every core will read these 
n/p keys sequentially. This will take n/p time and total n/B +p cache misses across all cores. We 
get total p elements. 

3. Write these p elements at p contiguous locations using p 2 jn cores in n/p time with total p cache 
misses, if p > Bp 2 /n or n > Bp (Lemma 2.6). Total cache misses will be 0(n/B) if n > Bp. 



4. Do prefix computation using max operator on these p points found is Step 3. To avoid block misses 
use only p 2 /n processors. Using p 2 /n cores, prefix computation on p keys can be done in 0(n/p+logp) 



time with total 0(p/B) cache misses given p > Mp /n or n > Mp (Lemma 2.3). 



5. Every core is assigned a task to find maximal points from one chunk. It will take 0(- log -) time 



and total 0(j| logjv/ n) cache misses across all p cores, if n/p > M. (Lemma 7.1 ). 
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6. Every core have one output list of maximal points from its corresponding chunk. Store these lists at 
contiguous locations. Using qp/n cores, we can join these p lists of total size q, (where q < n) in time 



0(n/p + logp) time and 0(q/B) cache misses, if q > Mqp/n or n > Mp (Theorem 4.2) 



□ 



7.3 Dividing Convex hull problem into Upper hull & Lower hull 

Lemma 7.3 Given a problem to calculate convex hull of n points, it can be divided into two disjoints 
problem of finding Upper hull and Lower hull of size m and n — m respectively in 0{n/p + logp) time with 
total 0{n/B) cache misses using p cores, given n > Mp. 

Proof: Divide the given problem into following simple steps : 



1. Find the point with maximum x co-ordinate and minimum x co-ordinate [Lemma 2.1 . It will take 
0(n/p + logp) time and total 0{n/B) cache misses, if n > Bp. 

2. We get two points p\ and p2- For every input point check whether it is on upper side of line P1P2 
or lower side of line p\P2- Based on this information, input points can be divided into two disjoint 
problem. 

3. Every core reads n/p keys and divide them into two buckets. This will take \n/p~\ time and n/B + p 
cache misses. We will get p lists each of each of n/p size, all divided into two problems. 



4. Using lemma 4.1, merge these lists into a single list divided into two buckets in 0{n/p + logp) time 
and total 0(n/B + 2p) cache misses, given n > Mp and n/p > 2 or n > 2p. Total cache misses are 
bounded by 0(n/B) if n > Mp. 

Whole algorithm can be summarized by this recurrence relation. 

T(n, 2) = pT(n/p, 2) + T merge (p, n/p, 2) 

□ 

7.4 Brute-force algorithm for intersection of half-planes 

Corollary 7.1 Given two points p\, p2 and n half-planes, find number of half-planes for which p\ and p2 
are on same side in Ol ^ + logp I time and total 0(n/B) cache misses, given n > Bp. 



Proof: This is simple instance of Lemma 2.5 where rank of point can be defined as the number of half- 



planes for which both given points are on the same side. □ 

Lemma 7.4 Intersection of n half-planes can be computed in 0(n 3 /p + logp) time using p cores with total 
0(n 3 /-E>) cache misses, given n 3 > Mp and a point x which will be present in this intersection. 

Proof: We will use a simple brute-force approach. We have n half-planes ki,k2 r . .,k n and a point x for 
which we already know that it will be inside intersection. As n 3 > M and M > B 2 , so we can say that 
n 2 > B. 
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As n lines can have maximum n 2 intersection points. We will simply check for each point whether 
it is a vertex of convex hull or not. A point p is a vertex of convex hull if p and x both are on same 
side of every half-plane For every point ki , where 1 < i < n 2 , find whether it is vertex of convex hull 

or not, in parallel using p/n 2 cores. Using lemma 



7.1 



each point will take O ( — + log p ) time with 



total 0{n/B) cache misses, given n > Bp/n or n 3 > Bp. Total cache misses for n points will be 
0(n 3 /B). 

If point ki is vertex of convex hull, then one of the processors which are working on k 1 ^ point will 
write this point in output. Maximum n cores are ready to write one point each. We need to write 
these n points in n contiguous locations. Create an output array A of length n 3 and key ki will be 
written at position n 2 ki in A. This will take 0(n/p) time, n cache misses and zero block misses if 
n 2 > B. n cache misses can be bounded by 0(n 3 /B) if n 2 > B. We have written these n elements 
in n 3 locations. 

Write these n points at n contiguous locations, in time n 3 /p with total n cache misses, using p/n 2 



cores if n > Bp/n or n > Bp (Lemma 2.6). Total cache misses are bounded by 0{n 4 /B) if n 2 > B. 



All these written points are vertices's of convex hull but they may not be ordered. Divide these 
n points into two disjoint problems Upper hull and Lower hull using left-most and rightmost point 



(Lemma 7.3) with p/n cores in time 0(n /p + logp) with total 0(n/B) cache misses, given n > 



Mp/n 2 or n 3 > Mp. 



5. Using lemma 2.7 sort both of these disjoint problems. We can sort n points along x-axis using n 2 /p 
cores in 0(n 2 /p + logp) time and 0(n 2 /B + n) cache misses which can be bounded by 0(n 3 /B) if 
n 2 > B. 

This concludes the lemma. □ 



8 Randomized Algorithm For Convex Hull construction 

The main result in [T7] states that Given n half-planes in R 3 , and a point t inside their intersection, we 
can find intersection of these half-planes in O(logn) expected time using n CREW PRAM processors. This 
algorithm achieves the optimal speed up. The basic steps of this algorithm are : 

1. Select O(logn) samples each of size [n e \ from K randomly, where < e < 1. 

2. Select a good sample using Polling. 

3. Find intersection of half-planes in selected sample using any brute force method to get a convex hull 
with points Pi, P2, . . . , P x (where x < e.) 

4. This convex polygon can be divided in x triangles (or sectors) of the form P\tP%, P2tP$, . . . , P x -\tP x , P x tP\, 
where t is given point which is present inside this hull. 

5. The original problem can be divided into x disjoint sub-problems, one sub-problem for each sector. 
For every sector find the intersection of half-planes which intersect this sector. 

6. As one half-plane can intersect many sectors so total size of all sub-problems may be much larger 
than size of original problem. But because of Polling we can claim that total size of all sub-problems 
will be 0(n), and size of maximum sub-problem will be < n 1_e logn. 

7. Use a filtering scheme to reduce total size of all sub-problems from 0{n) to In. 

8. If size of sub-problem is very small then solve it directly using any brute- force algorithm otherwise 
solve it recursively using this same procedure. 



24 



8.1 Adaptation to Multi-Core Model 

We will use the algorithm explained above with slight modifications on Multi-core cache oblivious model 
to achieve the same kind of bounds as the PRAM model. 

Theorem 8.1 Using p cores, the intersection ofn half-planes can be computed in expected time 0(— logn+ 
logn log logn) incurring an expected 0{^\og M n) cache misses, given n > Mp and M > B 2 B 2 / 7 . 

Proof: The whole algorithm can be divided into following steps: 

1. From n half-planes, using polling^ choose a good sample of size n e using p cores in expected 
time 0(p logn + log n log log ra) with expected 0(^log M n) cache misses, given e < 1/8, n > 



max{Mp, B 2 /^-^p} and M > B 2 B 2 / 31 (Theorem 8.2). 



2. Compute intersection of n e half-planes in 0(n/p + log p) time using n 3e p/n cores incurring a total of 
0(n 3e /B) cache misses, given n 3<E > Mn 3e p/n or n > Mp and a point O which will be present in this 



intersection. (Lemma 7.4). No. of processors needed for this are < p and total cache misses can be 



bounded by 0(n/B) if e < 1/3. 

3. We get a convex hull with points Pi, P%, . . . , P x (where x < n e .) Divide this convex polygon into x 
triangles (or sectors) of the form P1OP2, P2OP3, . . . , P x -\OP x , P x OP\, where O is some given point 
which is present inside this hull. 

4. Divide the original problem into x = 0(n e ) disjoint sub-problems. 

Using p processors, divide n half-planes into n e disjoint sub-problems in expected 0(^logn + 
log n log log n) time with expected Q (j| \ ogM n) cache misses , if n > n t<yX+2 \ x > 2, M > B 2 B 2 ^ 1 



and n >max{Afp, B 2+A / X p} Lemma (8.3). 
Choose x = (i — 2). Condition x > 2 is reduced to e < 1/8 and n > B 2+i l x p is reduced to 
n > B 2 /(i-4«)p. " 

5. Because of the way sampling was done in Step 1 (using polling), we can claim that total size of all 
sub-problems will be 0(n) [XT] . Use a filtering scheme to reduce total size of all sub-problems from 
0(n) to 2n. 

Using n e sectors, 0(n) half-planes can be filtered in expected time 0(^ logn + log n log log n) with a 



total of 0(t| log^n) expected cache misses, given n > Mp and M > B 2 B 2 / 31 (Theorem^). 



6. After filtering, total size of all sub-problems is < 2n and all these sub-problems has been divided into 
e sectors. To maintain the same n/p ratio throughout the recursion, increase the number of processors 
to 2p. As it has been proved (|19|) that at any level of recursion, maximum size of total sub-problems 
will be < 2n, 2p processors will be enough for this algorithm. No. of processors are assigned to 
each sub-problem according to its size. For e.g. problem with size x will get xp/n processors. For 
division of processors, prefix computation is done on n e sizes. Using n e p/n processors, it w ill take 

0\ ~ + logp ) time and 0(n e /B) cache misses, given n e > Mn e p/n or n > Mp (Lemma 



2.3) 



Maximum size of any sub-problem can be 2n( 1_<E ) logn. Those sub-problems which have size < n/p 
get only one processor and can be solved sequentially. We do not need to divide these sub-problems 
any further. Using one processor, intersection of b half-planes can be computed in 0{b\ogb) time 
and O(-glogj^fe) cache misses [lj. All other sub-problems are solved recursively using this same 
procedure. 



3 An efficient re-sampling technique that is described later 
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As mentioned above, to complete step 1-7 in claimed bounds we need e < 1/8, n > max{Mp, B 2 ^ 1 ie ^p} 
and M > B 2 B 2 / 31 . By choosing e = 1/32, all these can be reduced to n > Mp and M > B 2 B 2 /" 1 ' . 
This algorithm follows same kind of recurrence as showed in our sorting algorithm (Theorem 3.1). 



T"(n) 



0(K log n + log n log log n) + T"{rii) if n > K 
0(K log K ) otherwise 



(13) 



where K = n/p and rij < 2n 1_e logn or rij < 2n 31//32 logn or rij < n 63 / 64 if 2 logn < n 1 / 64 which will be 
true for large n. 

Likewise total cache misses for all p processors 



Q(n) 



,1/x 



0(§ log M n) + ^Q(n i ) 



i=l 



0(§ \og M K) 



if n > if 
otherwise 



(14) 



As solved in Theorem 
expected cache misses, 



3.1 



we get total expected time logn + log n log log n) and total 0(^log M n) 

□ 



Corollary 8.1 Given n half-spaces in B? that contains the origin, we can find their intersection using p 
cores in expected time 0{~ logn + logn log logn) with expected cache misses 0(^log A/ n), given n > Mp 
and M > B 2 B 2 / 7 . 

Proof: The above algorithm readily generalizes to 3 dimensions using the algorithm of Reif and Sen|17j 
where we use a simpler filtering step described in Amato, Goodrich and Ramos [3]. □ 



8.2 Intersection of Half-planes and Convex Hull 

Given a convex polygon divided into sectors and number of input half-planes, for every half-plane identify 
the sectors of convex hull it intersects. As shown in [19J, this problem can be reduced to a point location 
problem. In dual space these half-planes are mapped to points and vertices's of convex hull are mapped 
to lines. These lines will intersect with each other to create some regions. It can be proved that all those 
points which belong to same region in dual space, in primal space they are the half-planes which intersect 
same set of sectors in this convex hull. As a part of preprocessing, for every region find the set of sectors it 
intersect. After preprocessing, for any half-plane, do point location on these lines in the dual space. (These 
lines are vertices's of convex hull in primal space.) After finding region we can list all the sectors which 
this half-plane intersects. For point location problem we will use the algorithm of Dobkin and Lipton [12J. 
After preprocessing the half-planes, we can perform the following steps: 

1. For any query point we can find the unique cell in which it is present using two binary searches. 

2. For every region we can list the set of sectors in constant time where listing involves the start sector 
and the end sector. Because of convexity of the polygon, we can say that half-plane which pass 
through these start and end sectors will also pass through the sectors between these start and end 
sector. 

Lemma 8.1 Using p cores, m half-planes can be preprocessed in 0(m 4 /p + logp) time with 0(m 4 /B) 
cache misses, given m 4 > Bp so that point location among the half-planes can be done using two binary 
searches. 

Proof: The algorithm executes in the following steps : 
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1. For m half-planes we can have a maximum of m 2 intersection points. Sort these m 2 points w.r.t. x 
co-ordinate. Using p cores, It will take ol + logp J time and total 0(m 4 /B + m) cache misses, 



given m > Bp (Theorem 2.7). 



From Step 1, we get m 2 intervals or say vertical slabs. In each slab there will be n non-intersecting 
half-planes. We will order these half-planes according to y coordinate. Allocate equal number of 
processors to each slab, so that every slab will get p/m 2 processors. 

For every vertical slab find intersection points of m half-planes with both of its sides. This task can 
be done easily in [~m 3 jp~\ time and \{m/B + p/m 2 )~\ cache misses by assigning m?/p half-planes to 
each processor for every slab. Total cache misses for all vertical slabs will be (m s /B + p) which is 
bounded by 0(m 4 /B) if to 4 > Bp. 

All given m half-planes has been divided into m 3 regions and as said earlier, all points which belongs 
to same region will intersect same set of sectors. We just need to find the set of sectors, any region 
will intersect. Take one point from every region, map it to half-plane in primal space and test it 
against all sectors using some brute force algorithm to get a set of sectors for every region but as this 
polygon is convex we will only save start and end sector. 

This is done for all m 3 regions. Assign m 3 /p of these regions to every processor. Every processor 
will search linearly through all sectors to find the set of sectors for every half-plane. It will take a 
total of \m lp~\ time and — x | xp = [~m 4 /-E>] cache misses across all p cores. 

□ 



Lemma 8.2 Given n points and arrangement of m half-planes, for every point we can identify the re- 
gion it will lie in, in 0(- logn + log n log log n) time and 0{^\og M n) cache misses using p cores, given 
n >max{Mp, B 2+i l x p} ,n > m x+2 and x > 2. 

Proof: The entire procedure can be divided into the following steps : 



1. Given m 4 > Bp, pre-process m half-plane using Lemma 8.1 in 0(m 4 /p + logp) time with total 
0(m 4 /.B) cache misses. If n > m , time and total cache misses are bounded by Oinjp + logp) and 
0(n/B) respectively. After pre-processing, m planes have been divided into m 2 vertical slabs and 
every vertical slab is further divided into m horizontal slabs. Using this information, we will locate 
the region for every input point. 

2. Divide all the input points into these vertical slabs. 

Using Theorem 5.1, n points can be divided into m? vertical slabs in 0(~ log n + log n log log n) time 
and total 0(]| logjv/ n ) cache misses using p cores, given m 4 < n and n >max{Mp, £> 2 /( 1_21 °g™ m ^p}. 

3. All n points have been divided into m 2 sub-problems (vertical slabs) where the i th sub-problem 
contains s» points. Divide every sub-problem again into m sub-problems (horizontal slabs) using 
following steps : 

(a) First consider sub-problems with Si > m x . This x will be chosen after analysis. In worst case, 
all the sub-problems may have size > m x . Assign s-ip/n processors to i th problem according to 
size of this problem. 

(b) Processor allocation is done by doing prefix computation on size of these m 2 sizes of sub- 
problems. Using m 2 p/n cores, prefix computation on m 2 keys can be done in 0(n/p + logp) 



time with total 0(m 2 /B) cache misses, given m 2 > Mm 2 p/n or n > Mp (Theorem 2.3). We 
will have enough processors for this task if m 2 < n. 
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(c) Divide Sj points into m slabs in 0(Mogn + log n log log n) time and total 0(^log M n) cache 



misses using Sip/n cores, given Sj > m 2 and - s ^ n > max{ M, i? 2 ^ 1 logs i m )} or 
n/p ^maxlAf,^ 2 ^ 1-10 ^^} (Theorem |o~l). 

In the worst case this step is done for all m 2 problems. The total number of cache misses are 

2 

in 

/ ~k 1°§m 71 = ~b ^°Sm n f° r an m2 problems. The condition Si > m 2 is true if x > 2 because 
i=l 

we are doing this step only for those problem who have s, > m x . For the next condition, we can 
see that B 2 ^ 1 ~ loga i m ^ decreases as we increase Sj, so if condition n/p > £? 2 ' ( 1_logs i m ' is true for 

minimum then it will be true for all Sj. The minimum Sj is to 1 ', which reduce the condition 

to n/p > ^/(l-log^^m) Qr ^ y B 2x/(x-i) } where x > 2. 

(d) Solve rest of the sub-problems with sizes < m x . In the worst case there will be m 2 such problems 
and all of them will have size m x . 



Using Theorem 5.1 we can divide m x (x is a small constant) points into m slabs in 0(jj* log to + 
logmloglogm) time and total 0(^-log M m) cache misses using m x p/n cores, given m x > m 2 
and > B 2l{l ^°^ x ) m) or n/p >max{M, fl^A*- 1 )}. We win need a total of m *+2 p / n 

processors for m 2 problems. So we have enough processors for this task if n > m x+2 . The total 
cache misses for all m 2 problems will be 0( m B log M m) which can be bounded by 0(|| log^/ n) 
if n > m x+2 . 

After combining all of the conditions mentioned in steps above, we get n > m 4 , x > 2, n > m I+2 , n/p > 
maxjM,^/^- 1 )} and n/p > max{M, 5 2 /( 1 " 21og ™ m )}. As x > 2 so n > to x+2 implies that n > to 4 . 
As i? 2 /( 1_los « m ) decreases as we increase n, so if condition n/p > i3 2 /( 1_1 °Sn m ) is true for minimum n then 
it will be true for all n. Minimum n possible is m x+2 so n/p > B 2 ^ 1 21 ° s (m a =+ 2 ) m ) or > ^ g 
enough, where x > 2. As B 2+4 / x > B 2x ^ x ~^ so n/p > max{M, B 2+i / x } is enough. 

□ 



Corollary 8.2 Given n half-planes and a convex polygon divided into m radial sectors, for every half-plane 
h, all the sectors intersected by h can be identified in log n+logn log logn) time and 0(§ log^ n ) cache 

> m x+2 , x > 2 and n > max{Mp, _B 2+4 / x 'r 



misses using p cores, given n 



c p}. 



Proof: We get this result simply by combining Lemma 8.1 and Lemma[8.2| Map n half-planes to n points 



in dual space and m vertices's of convex polygon to m half-planes. 



Using Lemma 8.2 .given n points and to half-planes, for every point we can identify the region it lies in, in 
logn + log n log log n) time and 0(§ \og M n) cache misses using p cores, given n > m x+2 , x > 2 and 
n >max{Mp, B 2+4 / x p}. 



Because of preprocessing done on m half-planes in Lemma 8.2, any region can be directly mapped to its 
set of sectors. □ 



Lemma 8.3 Given n half planes, and a convex polygon divided between m radial sectors, for each sector 
Ci where i < m, we want to identify all the half-planes intersecting C{. If the cumulative output size is 
0(n), this task can be done using p processors in expected time logn + logn log logn) with expected 
0(^log M n) cache misses, given n > m x+2 , x > 2, M > B 2 B 2 ^ 31 and n >max{Mp,B 2+A l x p}. 

Proof: We can divide this task into following steps : 



1. Using Lemma 8.2 for every half-plane identify all the sectors this half-plane will intersect in log n+ 
logn log logn) time and 0(^log M n) cache misses using p cores, given n > m x+2 , x > 2 and n > 
m&x{Mp,B 2+i / x p}. 
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For every half-plane Ki (where 1 < i < n) if it intersect with sectors Sj,Sj + i,Sj + 2,- ■ ■ ,Sfc (where 
j < k and both j and k are < m), it means there will be U = (k — j + 1) copies of this half-plane 
in output list. So we need to write t{ copies of Ki and assign them number j,j + l,j + 2,...,k 

z 

respectively. There will be total t, t elements in output which is bounded by an (for some constant 

i=i 

as given). Sort this output list w.r.t. this assigned number. 
This step is explained in detail in following steps : 

(a) Assign n/p half-planes to every core so that every core will find ti for all of its n/p half-planes 
and add them. To add n/p elements, one processor will take 0(n/p) time and 0(n/pB) cache 
misses. The total cache misses across all cores will be n/B. 

(b) Every core will write its n/p half-planes in output list as explained above. To find the location 
where every core needs to write we will do prefix computation on these p sums found in step 
above. Using p 2 /n cores, prefix computation on p keys can be done in 0(n/p + logp) time with 



total 0(p/B) cache misses given p > Mp /n or n > Mp (Lemma 2.3 ) 



Sort these an elements using Theorem 3.1 w.r.t. number assigned to them above to get final 



output half-planes divided them into m sectors. It will take 0{ q j- log n + log n log log n) expected 



time and 0(^log M n) expected cache misses, given an > Mp and M > B 2 B 2 / 31 (for some 



B 

constant a). 



□ 



8.3 Sampling and polling 

To bound the total size of the sub-problems we need to find a good sample. A good sample means when 
we divide our n half-planes into disjoints sub-problems using this sample total size of sub-problems will be 
< cn and size of maximum sub-problem will be < 2n 1_e logn, where e is number of sub-problems. We will 
find this good sample using sampling and polling as shown in |19| . First we choose logn samples each of 
size n e and a random set of n/log 4 n input half-planes for all of these samples. Then for every sample we 
divide its input sample into disjoint problems using this sample. Then based on total problem sizes of all 
these samples, we choose a good sample from those logn samples [19] . 

Lemma 8.4 Using one processor, x elements can be sampled from n elements in 0(x log x + n) time with 
total 0(|| + H logx) cache misses, using x independent trials such that the probability of an element being 
chosen is same in each trial. 

Proof: We proceed as follows : 

1. Processor will generate x random numbers in range [1, n] and save them sequentially in an array of 
length x. It will take x time and incur x/B cache misses. 

2. Sort these x keys in time O(xlogx) with 0(|j logx) cache misses using one processor. 

3. Read these x numbers and input keys both from start and select those number whose rank is present 
in these x numbers. It will take total (x + n) time and (x + n)/B cache misses. 

□ 



Theorem 8.2 Using polling, a good sample of size n e can be chosen from n half-planes in expected 
0(p logn + logn log logn) time with expected 0(^log M n) cache misses, using p cores, given e < 1/8 
and n >max{Mp, S 2 /^" 4 ^} and M > B 2 B 2 / 31 . 
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Proof: We can divide the whole task into following steps : 

1. Using p cores, choose log n samples each of size n e from given n half-planes randomly, where < e < 1. 
For every sample, the probability for choosing any element is equal. Imagine that the input has been 
divided into n e contiguous chunks all of equal size and choose one key from every chunk. 

To choose one sample using pj log n cores, it will take [logn. n e jp\ parallel steps and one cache miss 
for every chosen key implying total n e cache misses. For all logn samples, total cache misses will 
be 0(logn n e ). The time is bounded by O(^logn) if e < 1 and the cache misses are bounded by 
< | log M n if n l ~ e > BlogM (which is true if n > B 2 ^ 1 -^). 

2. Find intersection of half-planes for every sample. 

Using pn Se /nlogn cores, intersection of n e half-planes can be computed in 0(^logn + logp) time 



with total 0(n /B) cache misses, if n > Mpn /nlogn or nlogn > Mp (Lemma 7.4) 

3e 



3e /B) cache misses, if n 3e > Mpn 3e / 
The total cache misses for logn samples will be O(^-logn) which is bounded by 0(]|log M n) if 
n 3e logM < n or e < 1/6. The total processors needed are pn Se /n which is < p if e < 1/3. 

We have to choose a random subset of n/ log 4 n planes for every sample. If we use the same method 
as used in Step 1, it will lead to total 0(n/ log 3 n) cache misses. 
To reduce cache misses we use the following strategy : 

(a) We have to choose total n/log 3 n elements. Every core is assigned n/p half-planes and gets a 
task to choose , n -> elements from these n/p elements. 

p log n ' * 



(b) Using Lemma 8.4, Every core will sample " 3 elements from n/p elements in Oinjp) time 

P log 71 



and 0(n/pB) cache misses. Total cache misses across all cores will be 0{n/B). 

Every core assigns a random number in the range 1 to logn to all of these - lo ^ 3 - half-planes. 



(d) Divide these n half-planes into logn problems w.r.t. this assigned number using Theorem 5.1 
Using p processors it will take 0(-logn + logn log logn) time and total 0{^\og M n) cache 
misses, given n > max{Mp, 5 2 /( 1_1 °Sn lo s n )p} an d l Q gn < -^/n. 



4. To find a good sample what we need is that every sample should divide its input sample into disjoint 
sub-problems and based on the total size of those sub-problems, we can pick a good sample from 
these. 

5. Using intersection found in step 1, every sample will divide its input sample (chosen in step 3) into 
disjoint problems. 

Given n/log 4 n half-planes (chosen in step 3) and a convex polygon divided into n e sectors (found 
in step 1), for every half-plane we can find all sectors this half-plane will intersect in 0(Mogn + 
lognloglogn) time and 0( B ^ 4 - log M n) cache misses using p/log 4 n cores, given lo ™i n > n e ( x+2 \ 
x > 2 and n > max{Mp, B 2+i / x p} (Corollary 



8.2 



We get an output list of size n/log n. To find total size of sub-problems we need to just add these 
nj log 4 n elements using p/ log 4 n processors. It takes time 0(n/p + log p) with a total of O( -^ht^) 
cache misses, given n > Bp. As both tasks are done for logn samples, so the total cache misses 
across all cores will be Q( gl( ^a w l°gM n ) which is < 0(n/B) 

6. For every sample we have found the total size of the sub-problems. After finding the maximum size 
we have enough information to select a good sample. Using plogn/n cores, maximum of logn keys 
can be found in time 0(n/p + logp) with total 0(logn/l?) cache misses given logn > Bplogn/n or 
n > Bp. 

All of the above conditions can be combine as n > n e ^ x+2 ^ log 4 n, x > 2 and n > max{Mp, B 2+ ^ x p} 
M > B 2 B 2 / 31 , e < 1/6, n > B 2 ^ 1 ^, n > max{Afp, B 2 /( 1 " lo Sn lo s n )p}. 
By choosing x = - 2), we obtain n > B 2 ^ 1 - 4 ^, e < 1/8, M > B 2 B 2 / 31 
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□ 



8.4 Filtering 

The total size of sub-problems after obtaining a good sample is 0(n). We do some further pre-processing to 
reduce this size to exact size of parent problem. This step is a kind of post-processing step after sampling. 
For every sector we identify half-planes which intersect it. Some of them may be part of the output while 




Case (a) 




Case [b] 



Figure 8: (a) B dominates A (b) No half-plane is dominated 

some of them may not show up in output (in that sector). Since the output size is bounded by input size 
so our objective is to eliminate all those half-planes from a sector which do not show up in output. This 
task can be reduced (T9j to sorting and maximal points problem. 

As shown in figure [8^a) we can say that half-plane A is dominated by B through-out the sector so it 
would not be part of output in this sector. In filtering step we will remove all half-planes of these kind. 
After filtering there would not be any half-plane left in any sector such that it is dominated by any other 
half-plane as shown in figure [8^b) . 

Filtering : We are given some sectors and some half-planes. From these half-planes we will remove most 
of those planes which would not be part of output (intersection of given half-planes) . If output size if x 
(intersection of given half-planes), then after this filtering we will have utmost 2x half-planes left. 

Theorem 8.3 Given a sector, n half-planes can be filtered in expected time 0(— log n + log n log log n) with 
a total of 0(]| logjvf n) expected cache misses , given n > Mp and M > B 2 B 2 ^ 1 . 

Proof: This task can be divided into the following steps : 

1. Consider one side of a sector and find intersection points of half-planes with this side. This can be 
done in \n/p~\ time and \n/B~\ cache misses by assigning n/p half-planes to each processor. 



2. Sort these half-planes w.r.t. distance of intersection point found above from center of sector. Using 
Theorem 

misses, given n > Mp and M > B 2 B 2 / 31 . 



it will take expected time 0(^ logn + log n log log n) and expected 0(jy logj^f n) 



cache 
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3. Repeat steps 1 and 2 for other side of sector. 

4. We get two sorted lists say x and y. For half-plane rn consider a tuple of the form Xi,yi, where 
(xi,yi) are rank of these half-planes in these sorted lists. To find this tuple we can do the following 
- In list x write rank on every half-plane and then sort this list w.r.t. half-plane number. We get a 
list of ranks for all half-planes. 

5. We have a tuple (xi,yi) for every half-plane. A half-plane n« is completely occluded by another 
half-plane rij if Xj > x\ and yj > y%. In that case half-plane rtj would not be part of output. Find 
maximal points on this tuple. 



6. Using Theorem 7.2, given n points, we can find the maximal points in expected time 0(^logn + 



log n log log n) with a total of 0(§ log M n) expected cache misses using p cores, given n > Mp and 
M > B 2 B 2 / 31 . 



□ 



Theorem 8.4 Given n e sectors, we can filter 0(n) half-planes in expected time 0(- log n + log n log log n) 
with a total of 0(^log AI n) expected cache misses , given n > Mp and M > B 2 B 2 / 31 . 

Proof: We have a total of an half-planes divided into n e problems. Each problem has size Si, (where 

1 < i < n e ) and Sj = an. We will do filtering on all of these n € problems in parallel. Every problem is 

i=i 

assigned number of processors according to its size, so the i th problem is assigned s^p/an processors using 
prefix computation (Theorem 2.3). 



Using Theorem |8.3| described above, one problem of size Si (where Sj < an) using Sip/an processors can be 
filtered in expected time 0( g ^ logn + log n log log n) with a total of 0(^log M n) expected cache misses , 

given Si > Msip/an or an > Mp and M > B 2 B 2 / 31 . The total cache misses for all the n e problems is 
0(flog M n). 

For any constant a, time is bounded by 0(^logn + logn log logn) and total number of cache misses is 
bounded by 0(§ log M n). □ 



9 Concurrent Writes and processor oblivious load balancing 

Traditionally, parallel programs are written assuming that there is a unique id (an integer in the range 
1 . . .p) for each of the p processors. The processors id is used to designate a particular task to a specific 
processor. For example, for an input array of n numbers, a processor with id i may be allocated the task 
associated with the sub-array — ■ i . . . — ■ (i+ 1). This is easy because p is known at the time the parallel code 
is generated. However, in some situations p may not known and moreover, the processors may not have an 
id associated with them. We illustrate this method for the fundamental problem of prefix computation. 
The basic idea is that each of the p processors simultaneously chooses a random number in the range [l..n] 
and writes to the corresponding location in an array A. The expected number of processors writing to a 
specific location A[i] 1 < i < n, is p/n and no more than O(logn) w.h.p. - note that, from our earlier 
assumptions, p < n/B, so the expected number of elements writing into a B element block < 1. Roughly 
speaking, a processor writing to a location i assumes responsibility for the block of n/p elements starting 
from L^T^J- However, because of conflicts caused by independent random choices, we have to do some 
limited redistribution and also estimate the value of p. It can be argued using Chernoff bounds that w.h.p. 
0(logn) processors will choose a location in the range [bi . . . b(i + 1)] i = 0, 1 . . . where b = (n/p) ■ logn. 
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From the previous observation, p can be estimated in the following manner. Let the leftmost processor 
in the the array count up to logn processors. If this location is /3, then we can estimate p to be within a 
constant factor of (n//3) ■ logn). This procedure requires more elaboration. 

To find the leftmost processor, we can let every processor that succeeded in writing to a location in 
the first step traverse left in a synchronous fashion. Once it encounters a location that has been already 
written into, it terminates the traversal. Only the leftmost processor will be able to continue and reach 
location 1. 

In the second phase when it counts up to logn processors, we have to ensure that the processors that 
conflict in the random choices are counted with the proper multiplicity. If j processors write to a single 
location, then we let these processors increase a counter and the final count is the number of the processors - 
since j < log n w.h.p., this step completes in 0(log n) time. So the leftmost processor can start its rightward 
traversal after O(logn) steps and report the location (3. Note that the expected value of j3 is (n/»)logn 
which can be subsumed in the overall time for a problem like sorting but not for prefix computation. So, 
for prefix computation, we can assume that p < (n/ logn) and run the estimation procedure in an array of 
length n/logn, so that the overall time for the estimation is 0((n/logn) • (1/p) ■ logn) which is 0(n/p). 

The processors that belong to the block (of size (n/p)logn) must be assigned a unique id which can 
be computed as follows. Once p is estimated, each processor knows its most significant logp — log logn 
bits of the id. The remaining log log n bits can be assigned on the basis of a simple prefix sum within the 
block carried out by the leftmost processor within the block (once p is estimated, the leftmost processor 
in a block is easily identified by repeating the procedure within the block). Following this, each processor 
is assigned a unique block of 9(n/p) elements. 

In the above procedure, we have not accounted for the block misses when processors conflict in writing 
to a specific block. For instance, if j processors write to the same block, the block misses will be 0(j' 2 ). 
This could be as much as f2(log 2 n) since we can only bound j < logn with high probability. The overall 
block misses will be given by Yl7=i n i where n! = n/B and rii is the number of processors writing in block 
i. Note that ^ n« = p. We can compute the expected block misses as follows. Since rij denotes the 
number of processors that chose block j, we are interested in the quantity -E^X^n 2 ] that represents the 
total expected block misses. Let r.v. Xi = 1 if processor ^] writes to block 1 and otherwise (the same 
analysis will apply to any fixed block by symmetry). Then -EfXj] = 1/n' and moreover -E[X 2 ] = 1/n'. 
From our earlier notation, n\ = X*?=i an< ^ so n i = Y^=i Xf. Taking expectations, 

E \ fe 1 = vE\Xf\ + Q • 2E[X t . Xj ]=p-±j+ p(p - 1) • E[Xi] ■E[X j ]=p-±j+ p(p - 1)± 

The first equality follows from linearity and the second from independence. Therefore, from symmetry, 
E[n\ + n 2 + • ■ •] = n' • (p/n' + p(p — l)/n /2 ) = p(l + (p — l)/n') which is 0{p) for p < n' = n/B. 

For prefix computation, we proceed as follows. Given p < n/logn processors, we first estimate p 
using the first n/logn locations (i.e. n'/logn blocks) of the array and also the processor id. If there are 
clogn processors for a block of size n/plogn, then a processor with id = < m',£' > where ml denotes 
the most significant logp — log logn bits, and £' denotes the least significant bits is assigned the locations 
[a ■ m! + /3 • £' . . . a ■ m! + /3(£' + 1)] where a = n/plogn and f3 = - ■ (n/p). 

The first phase takes 0(n/p) sequential steps, following which the number of data items is p. We can 
run the optimal (cache oblivious) speed-up prefix computation on the p processors. 

10 Future Work 

For both sorting and convex hull algorithm given here, running time is not optimal when we have n cores. 
The bottleneck step in these algorithm is where we divide input into buckets. This step is being imple- 

6 this is only for the analysis since a processor is not explicitly given any id; also note that we need Concurrent Write 
capability for this 
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mented using a deterministic algorithm. We can try to optimize this by using a randomized approach. 
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